Exploratory Analysis of table actor¶

Filter unwanted columns¶

According to the wiki page, we can get rid of those columns:

  • standard_text_property
  • count_text_property
  • concat_names

Table extract¶

pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
20302 24321 Actr24321 Poilblan, Gustave NaN 1 None NaN 1 None 1 None 104.0 11.0 2009-12-21 17:54:44.000 11.0 2013-12-18 15:24:16
33234 35095 Actr35095 Munier, Félicien 1872.0 3 None 1926.0 1 None 1 None 104.0 11.0 2010-05-31 15:19:54.000 11.0 2013-12-18 15:24:16
58131 2004 Actr2004 Espinosa, Miguel de 1580.0 3 None 1630.0 1 None 1 None 104.0 27.0 2008-11-09 00:11:18.000 50.0 2017-11-28 12:12:47
52506 48188 Actr48188 Triveri, Francesco Antonio - da Biella 1631.0 1 None 1697.0 1 None 1 None 104.0 30.0 2014-03-16 09:18:33.900 50.0 2016-10-20 11:43:57
23887 54356 Actr54356 Luc, Jean André de 1763.0 None 2 1847.0 None 2 1 fb_import_20140911_3187 104.0 3.0 2014-09-11 22:23:09.190 NaN 2014-09-12 12:22:41

Filter only wanted rows¶

Some of the rows has been identified to not be imported (see this wiki page).

Rows number before filter: 61556
Rows number after filter: 59526 (2030 have been removed)

Filter by Actor type¶

For now we are interested only in persons.

Persons can be found by having the column fk_abob_type_actor being 104.

Number of not 104 actors: 3

pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
10340 59031 Actr59031 Forster, James 1830.0 3 3 1930.0 3 3 1 None 106.0 81.0 2016-11-29 11:05:00.060 81.0 2016-11-29 11:05:00
28940 60660 Actr60660 Valjean, Jean 1769.0 1 None 1833.0 1 None 1 None 106.0 122.0 2018-10-23 16:48:50.050 122.0 2018-10-23 16:48:50
46002 46914 Actr46914 Dieu (conception chrétienne) NaN 1 None NaN None None 0 None 106.0 3.0 2013-07-04 11:43:15.990 3.0 2013-12-18 15:24:16

Discovery¶

Columns contain:
Total number of rows: 59523
  -             "pk_actor":   0.00% empty - 59523 (100.00%) uniques (eg: 44895; 47015)
  -          "concat_actr":   0.00% empty - 59523 (100.00%) uniques (eg: Actr44895; Actr47015)
  - "concat_standard_name":   0.00% empty - 56550 ( 95.01%) uniques (eg: Sainte-Mar...; Costantino...)
  -           "gender_iso":   0.00% empty -     3 (  0.01%) uniques (eg: 1; 2)
  -        "creation_time":   0.00% empty - 34441 ( 57.86%) uniques (eg: 2012-04-08...; 2013-07-26...)
  -    "modification_time":   0.00% empty - 13973 ( 23.47%) uniques (eg: 2013-12-18...; 2016-10-21...)
  -              "creator":   0.01% empty -    88 (  0.15%) uniques (eg: 43.0; 30.0)
  -             "modifier":   8.92% empty -    85 (  0.14%) uniques (eg: 2.0; 30.0)
  -      "certainty_begin":   9.42% empty -     4 (  0.01%) uniques (eg: 3; 1)
  -        "certainty_end":  14.48% empty -     5 (  0.01%) uniques (eg: 3; None)
  -           "begin_year":  18.56% empty -   847 (  1.42%) uniques (eg: 1870.0; 1506.0)
  -             "end_year":  50.66% empty -   819 (  1.38%) uniques (eg: 1930.0; 1545.0)
  -          "notes_begin":  67.74% empty -     5 (  0.01%) uniques (eg: 3; 2)
  -            "notes_end":  72.41% empty -     6 (  0.01%) uniques (eg: 3; 4)
  -                "notes":  89.85% empty -  6012 ( 10.10%) uniques (eg: <p>Il s'ag...; None)

Type parsing¶

According to the table before, we will parse each column by the most meaningful type.

Columns analysis¶

Here we will report the analysis of interesting information found on different columns. They are not exhaustive.

For some columns, we will update their value.

gender_iso¶

We observe some of the gender values being undefined. As the ISO mentions, it should be 0, 1, 2 or 9. So we replace the undefined gender by 0.

certainty_begin¶

We replace the not filled values by 0.

begin_year¶

certainty_end¶

We replace the not filled values by 0.

end_year¶

creation_time¶

creator¶

notes¶

All HTML tags, non ASCII chars and new line are removed.